The identification of functional motifs in temporal gene expression analysis

The identification of transcription factor binding sites is essential to the understanding of the regulation of gene expression and the reconstruction of genetic regulatory networks. The in silico identification of cis-regulatory motifs is challenging due to sequence variability and lack of sufficient data to generate consensus motifs that are of quantitative or even qualitative predictive value. To determine functional motifs in gene expression, we propose a strategy to adopt false discovery rate (FDR) and estimate motif effects to evaluate combinatorial analysis of motif candidates and temporal gene expression data. The method decreases the number of predicted motifs, which can then be confirmed by genetic analysis. To assess the method we used simulated motif/expression data to evaluate parameters. We applied this approach to experimental data for a group of iron responsive genes in Salmonella typhimurium 14028S. The method identified known and potentially new ferric-uptake regulator (Fur) binding sites. In addition, we identified uncharacterized functional motif candidates that correlated with specific patterns of expression. A SAS code for the simulation and analysis gene expression data is available from the first author upon request.


Introduction
Gene expression exhibits temporal and spatial patterns in response to environmental changes and as part of developmental and differentiation processes. The binding of transcription factors (TFs) to regulatory elements of genes controls when and where specific genes will be expressed. The rate of gene transcription is regulated largely by the TFs that bind and affect the affinity of RNA polymerase for the transcription initiation site of the gene. The identification and testing of relevant TF binding sites remains a significant challenge in functional genomics (Tompa et al. 2005).
Traditionally, TF binding sites have been characterized by experimental methods. The availability of complete genome sequences enables us to use computational tools and advanced statistical methods to predict new potential TF binding sites. In addition, recent advances in high throughput gene expression analysis technologies can provide large amounts of detailed expression data. These techniques include DNA microarray (Conway and Schoolnik 2003;Eisen et al. 1998;Spellman et al. 1998), SAGE (serial analysis of gene expression) (Angelastro et al. 2000) and in vivo gene expression using promoter reporters (Anderson et al. 1988;Bjarnason et al. 2003;Blouin K. 1996;Kalir et al. 2001;Setty et al. 2003;Van Dyk et al. 2001;Zaslaver et al. 2004) and in vivo TF binding techniques Braas et al. 2003;Elemento and Tavazoie 2005;Pritsker et al. 2004;Rosenfeld et al. 2005) (Ui et al. 1998).
Thorough the comparison of expression profi les, genes or putative genes can be grouped based on similarity of expression profi les by cluster analysis. Within the same cluster, genes are assumed to be transcriptionally co-regulated, and upstream regions of these co-expressed genes can be searched for shared sequence motifs. High conservation of upstream sequence motifs has lead to the widespread use of multiple alignments to search for conserved upstream nucleotide sequences (Conlon et al. 2003;Eskin and Pevzner 2002;J. van Helden 2000;J. van Helden 1998;Sinha and Tompa 2000) for motif discovery in several eubacterial species (McGuire et al. 2000) and Sacchromyces. cerevisiae (Frederick P. Roth 1998;Hughes et al. 2000). Although the strategies can identify many signifi cant repeats or conserved sequences upstream of the coding region, the statistically signifi cant meaning of the putative motifs is based solely on the frequencies of the nucleotides or patterns against the genome species. It doesn't indicate the probability that the putative motifs are TF binding sites or have biological relevance for gene expression (Caselle et al. 2002;Cora et al. 2004), and these putative TF sites must be confi rmed by wet-bench genetic analysis. Compared with relatively simple bacterial genomes, the TF binding sites in eukaryotes tend to be much shorter and the size of the potential regulatory region much larger, consequently the number of the predicted putative motifs will be greater. Confi rming all putative motifs in all organisms by wet bench experimental analysis becomes challenging. Therefore approaches that would decrease the number of putative sites and effi ciently obtain functional motifs are crucial issues in the in silico analysis of regulatory sites. Using the combined analysis of complete genome information with gene expression data it is possible to identify statistically signifi cant putative motifs. However, the current motif discovery methods enable us to overestimate the putative motifs compared to what we expect to be signifi cant from biological data (Cora et al. 2004;Cora et al. 2005). Using traditional statistical methods, the identifi cation and testing of functional motifs involves multiple comparison tests, and the avoidance of Type I error, where a null hypothesis is incorrectly rejected, can be problematic. Although some researchers have tried to explore analysis techniques to address these issues (Keles et al. 2002;Kessler and Witholt 2001), the present status of research suggests that the exploration and application of the new analysis techniques would be advantageous In this paper, we adopted the method of controlling the false discovery rate (FDR) (Benjamini 1995) to decrease type I error and estimated motif candidate effects with longitudinal model (Wolfi nger et al. 2001). We are interested in identifying putative functional motifs within co-regulated genes derived from temporal expression data. In the current study, we demonstrate that controlling the FDR and motif effect estimation are more appropriate for functional motif detection, and illustrate the strategy via a simulation study and time series gene expression data in Salmonella typhimurium.

Materials and Methods
Defi nition of false of the false discovery rate (FDR) The FDR is the expected proportion of true null hypotheses erroneously rejected out of the total number of null hypotheses rejected (Benjamini 1995). In theory, if R null hypotheses are rejected in multiple comparison tests, V is the number of true null hypotheses erroneously rejected. FDR is defi ned as: Assume that m, the number of multiple comparison tests, are simultaneously tested, there are m null hypotheses H 1 , H 2 , ..., H m on basis of independent test statistics Y 1 , Y 2 ..., Y m , from each Y i , fi guring out corresponding p-values, P 1 , P 2 ..., P m , then denoting the ordered values as P (1) Յ P (2) Յ ... Յ P (m) , P (1) , being the most signifi cant and P (n) the least signifi cant in the usual terminology. The values to control FDR when P (i) are independently distributed are given by the step-up formula: We reject P (1) , P (2) , ..., P (k*); if no such k exists, we reject none. It has been proven that the FDR could be controlled at some level, q (Benjamini 1995). That is, out of k hypotheses rejected, it is expected that the proportion of erroneously rejected hypotheses is not greater than the FDR adjusted p-value.

Analysis of simulated data
The simulated data was generated by Monte Carlo simulation. We simulated 10 promoters that were associated with 50 sequence motifs: 8 functional motifs (two motifs with negative effect and six motifs with positive effects) and 42 nonfunctional motifs. The simulation was run 50 times and the simulated gene structure is shown in Figure 1. We assume that each gene has a conserved expression profi le, three motifs upstream of the gene, and that motif effects are additive. A positive effect indicates that the TF site would work to enhance or activate gene expression, and a negative effect indicates that the TF site works to repress or hinder gene expression.
Here we temporarily ignore non-linear interaction among motifs and assume that the effects of multiple motifs are additive. All combinations between promoters and motifs have random uniform distribution. The simulated parameters are shown in Table 2; the simulated model is as follows: Here, Y i is gene i expression level; G i is the ith gene conserved expression profi le, i = 1,2, ..., 10; Motif jk is the kth motifs additive effects in the jth cluster; j,k are the number of cluster and motifs, respectively. εi is the ith normal random effects.
To check family-wise error rate (FWER), we shuffl ed the motif order against gene expression level 50 times to obtain the permutated data. For the simulated motifs, we tested by t-test for each of 50 motifs in both the simulated data and the permutated data. Under the assumption of unequal variances, the approximate sig statistic is computed as )] [ /( )] + − w n sig is the signifi cant value of statistics; x i is the mean of the ith candidate motif in a cluster and given gene expression experiment; x is the mean of a cluster and given gene expression experiment; n 1 is the number of the ith candidate motif in a given cluster and gene expression experiment; n 2 is the total number of a given cluster and gene expression experiment.
After the 50 tests were ordered by P (i) , the FWER and the FDR were determined as described above.
Analysis of real gene expression data and estimation of motif effect in S. typhmurimum A previous study by our group (Bjarnason et al. 2003) identified iron responsive genes in S. typhimurium by screening a random promoter library in hogh and low iron. Expression profi les for the iron response clones were further organized on the basis of their expression profi le across 11 conditions and 5-8 time points using cluster analysis (Eisen et al. 1998). Cluster analysis arranges genes according to their similarity in patterns of gene expression. Genes previously demonstrated to be repressed by the transcriptional regulator Fur were found within one of the larger clusters. Fur is primary transcriptional regulator involved in the regulation of iron uptake and metabolism.
We took 300 base pairs (bp) of upstream sequences of each gene in this cluster and tried to fi nd sequence patterns from the unaligned DNA sequences. We adopted the Mismatch Tree Algorithm (MITRA) and MEME -approaches to obtain composite regulatory patterns that are groups of monad patterns that occur near each other (Bailey 1999;Eskin and Pevzner 2002). The MITRA found 58 dyad motifs of length 6bp or greater in this set of co-regulated genes. We used unequal variance t-test where a signifi cant t-value is indicative of a putative motif or composite pattern affecting the gene expression in the condition of that time point. The FDR adjusted p-value was computed as described above. and -10 regions. The combination of the three motifs with the basal promoter element was random. The motif effects could be negative, positive or have no effect. In the screened motif candidates we obtained consensus candidates. In order to quantitatively evaluate the motif candidates, we estimate the motif candidates with a longitudinal model. Let the random variable Y ij = Y t (t ij ) denote the gene expression level of ithgene, measured at t ij in each experiment. We then assume that Y ij satisfies Where n i is the number of longitudinal measurements available for the ith gene, and where all error components ε ij are assumed to be independently normally distribution with mean zero and variance σ 2 . The Y ij can be rewritten as All analysis processes were implemented by SAS.

Simulated data
In order to evaluate the different statistical analysis methods, a simulated data set was generated by combining regulatory motifs with basic promoter elements, as illustrated in Figure 1. The false discovery rate (FDR) adjusted p-value, familywise (or experimentwise) error rate (FWER) and comparison-wise error rate (CWER) computed from the t-probabilities in the simulated data set with ten genes and eight functional motifs are plotted in Figure 2A. The fi rst 13 comparisons in the simulated data and fi rst seven comparisons in permutated data are shown in Table 3. From  Fig ure 2A, at very low probabilities of null hypotheses, FDR adjusted p-value, CWER and FWER are very close. With increasing numbers of rejected hypotheses, the FDR adjusted p-value is always lower than the FWER and higher than the CWER. From Table 3, at i = 12, FDR adjusted p-value = 0.2016, FWER = 0.91096, CWER=0.04837, based on t-probability or CWER, 12 motifs are detected, which could be considered "true" functional motifs. Based on FWERϽ0.5 criteria, the FDR adjusted p-value =0.01778, nine functional motifs would be detected, all of the eight true motif in simulated data are in the detected motif list, at i = 8, FWER=0.1478, FDR adjusted p-value = 0.0200, CWER=0.0032. Thus, FDR adjusted p-value Figure 2: A. The plot of FDR adjusted p-value, experiment-wise type I error (FWER)(alpha) and comparison-wise type I error (CWER)(pt) in the simulated data set with ten genes and eight functional motifs and 50 motif-gene expression combinations. The x-axis is the number of hypotheses rejected and the y-axis is he probability level for the different statistical tests. B. The plot of FDR adjusted p-value, experimentwise type I error (FWER) (alpha) and comparisonwise type I error (CWER)(pt) in the shuffl ed simulation data set with ten genes and eight functional motifs and 50 motif-gene expression combinations. The x-axis is the number of hypotheses rejected and the y-axis is he probability level for the different statistical tests. controlled FDR, the FDR is similar to the family wise rate, so in such a situation controlling the FDR adjusted p-value is same as the controlling of FWER. When the number of null hypotheses is less than that of all hypotheses under testing, the FDR adjusted p-value is much smaller than that of FWER.

Permutation data
In order to generate a negative data set, the putative motif and condition-time point associations calculated above were randomly permutated. The FDR adjusted p-value, FWER and CWER were determined and plotted in Figure 2B and also shown in Table 3. Because the relationships among the putative motifs and gene-conditiontime points have been randomized, no null Note 1 : In addition to these two negative and six positive motifs, 42 motifs with no effects were included in the simulated dataset hypotheses should theoretically be rejected. As we can see in Figure 3, when the association between the putative motifs and expression data was shuffl ed, the FWER sharply increased. From Table 2, at i = 2, FDR adjusted p-value =1.00346, FWER = 0.8656, CWER = 0.04014. Based on the CWER criteria, two motifs were tentatively detected which could be considered "true" functional motifs. However from FWER Ͻ 0.5 and the FDR adjusted p-value, nothing of significance was detected. The FDR adjusted p-value larger than one would imply that the number of Type I errors exceed the number of rejected hypothesis. These results illustrate how unreliable the CWER is in multiple comparison tests of motif discovery.

Expression data from an iron-regulated cluster from S. typhimurium
We have previously characterized iron responsive genes in S. typhimurium (Bjarnason et al. 2003). Iron responsive genes were clustered on the basis of their expression profi les across 11 conditions and time points via cluster analysis- (Eisen et al. 1998), and one signifi cant cluster containing known Fur responsive genes was selected for analysis. Fur mediates the majority of transcriptional repression to iron in bacteria (Earhart 1996). We adopted the Mismatch Tree Algorithm (MITRA) (Eskin and Pevzner 2002) to search for composite regulatory patterns in the 300bp sequence upstream of each gene. The MITRA found 58 dyad putative motifs of length 6bp or greater and the unequal variance t-values and their corresponding probabilities were calculated from the time series gene expression experiment. For the 3886 (67 time points by 58 dyad putative motifs) pattern-condition-time point association tests of the genes in the iron regulated cluster, the FDR, CWER and FWER are plotted in Figure 3. The behaviors of the indices are similar to those in Figure 2 Table  4, the minimum of FDR adjusted p-value is 0.0043. It is worth noting that the motif A candidate in the Table 4 is similar to reported Fur motif binding sequences (Earhart 1996).
Graphical representations of the consensus sequences derived from WebLogo (Crooks GE 2002;Schneider and Stephens 1990) are shown  Figure 4. Each logo consists of stacks of wDNA symbols for each position. The overall height of the stack indicates the sequence conservation (nucleotide presence/conservation) at that position, while the height of symbols within the stack indicates the relative frequency of each nucleic acid at that position. The sequence logo provides a visual description of a binding site. The predicted consensus for Motif A matches that of the published Fur consensus site (Earhart 1996).
In addition to the known Fur binding sites in this set of promoters, additional Fur sites are predicted. Motif candidate B and C did not match any known transcription factor binding sites and may represent a new TF binding sites. This potential regulatory motif is currently being investigated experimentally. The estimation of the motif candidates via longitudinal model In order to quantitatively describe the motif candidates, we estimated motif effects with a longitudinal model. Motif effects are defi ned as the motif candidates take effects for their locating genes over time, Table 5 shows motif effects which contain hypothesis tests for the signifi cance of each of the motif and interaction effects which contain hypothesis tests for the interaction between time and motif, and indicates that fi xed effects of the motif candidates and the interactions among motif candidates, time and quadratic time are very signifi cant. Subsequently, the maximum likelihood (ML) and restricted maximum likelihood (REML) and minimum variance quadratic unbiased estimation (MIVQUE0) are used to estimating for all parameters in the longitudinal model. From Table 6, the estimations of the three methods for the parameters are the same, but the standard errors of the ML estimates are less than that of the REML and MIVQUE0, and the estimates and standard error of the REML are the same as that of MIVQUE0.
Further investigation of the estimates for the parameters shows that signifi cant effects seem to be present among the motif candidates A and B, although they have opposite effects. The motif candidate C has the weakest effects (0.0438). There are signifi cant positive interactions between motif candidate B and time effects, and weaker interactions between motifs A and C and time effects. The results also indicate that the interaction of all motif candidates and quadratic time effects are negative and weak, and suggest that motif infl uences gene expression level over time.

Diagrams of functional motifs and predicted promoters
To illustrate the distribution of the predicted motif candidates and the relationship between the functional motifs and promoters, we used BCM Search Launcher (Smith et al. 1996) to predict the position of the promoters. The positions of the potential motifs were mapped upstream of coding region. A motif occurrence is defined as a position in the sequence with a match that has a signifi cant p-value and signifi cant effects for gene expression levels (FDRϽ0.05). The ordering and spacing of all non-overlapping functional motif occurrences and the highest score promoters are shown for each upstream sequence in Figure 5. We fi nd the distribution of the motifs is neither normal nor uniform. We also fi nd that most of the genes are predicted to be regulated by more than one TF binding site, consistent with the control of transcription by comprehensive interactions among the DNA binding sites.

Discussion
The past few years have witnessed a dramatic increase in our knowledge of primary genetic information at the level of genome sequences, which has been complemented by the development of methodologies for genome scale analysis of gene expression. The merging of these two knowledge bases provides an opportunity for rapid in silico analysis of genetic regulation. In principle there are many potential TF binding sites that can exist for any given gene. One of the fundamental challenges is the accurate prediction of TF binding sites and ultimately the estimation and evaluation of their qualitative and quantitative effects on gene expression. In addition to the specifi c TF binding site, contextual information can influence the quantitative effects of a particular site. This information includes surrounding DNA sequence effects (infl uencing such processes as DNA fl exibility and intrinsic curvature), spacing with respect to promoter elements and combinatorial effects of multiple TF elements. These infl uences are not readily predicted from our current understanding of gene regulation and experimental verifi cation is still required for many predicted TF sites. By combining gene expression data with motif prediction and the application of statistical analysis, the number of predicted TF binding sites can be signifi cantly reduced with a greater degree of confi dence (Cora  Cora et al. 2005) Here we have demonstrated that using an adjusted False Discovery Rate (FDR) and estimation of motif effects as a statistical strategy improve the prediction of real relative to false TF binding sites. The combined analysis of motif prediction and gene expression data is complex, involving thousands of multiple comparison tests. Avoidance of type I error and effi ciently identifying functional TF binding sites is not only of theoretical importance but will also reduce the amount of experimental work required for verification. The traditional approach to dealing with multiple comparisons is through the control of family-wise error rate (FWER), rather than controlling the "comparison-wise error rate" (CWER). FWER is the probability of one or more false rejections of true hypotheses, regardless of how many hypotheses are true and what value the parameters of the false hypotheses take. FWER is controlled by strictly setting the specifi c rejection threshold, so that the probability that any of the null hypotheses tested are erroneously rejected is below a specified low level. The false discovery rate (FDR), the expected ratio of erroneous rejections to the number of rejected hypotheses, gives us an alternative choice. In our simulation experiment, as documented by other researches (Dudoit 2003;Reiner et al. 2003;Storey and Tibshirani 2003a;Storey and Tibshirani 2003b), the FDR adjusted p-value is very similar to FWER when the number of null hypotheses is tested. In such a situation, controlling the FDR adjusted p-value is similar to controlling the FWER. Multiple comparison procedures controlling the FDR adjusted p-value are more powerful than the commonly used multiple comparison procedures based on FWER and CWER. FDR is well suited to large multiple comparison problems in which existing procedures lack power, especially for the preliminary identifi cation and tests of functional motifs in large scale gene expression data and bundles of putative motifs.
The identification of putative regulatory motifs is another challenge in this research. The methods for discovering DNA patterns are directly related to the quality of putative motifs and the accuracy of building genetic networks. DNA pattern discovery methods (Alvis Brazma 1998;Eisen et al. 1998;Eskin and Pevzner 2002;J. van Helden 2000;J. van Helden 1998;Szymon M. Kielbasa 2001;Tompa et al. 2005;Zhou Zhu 2002), look at the signifi cant patterns (J. van Helden 1998), monad or spaced dyads (Eskin and Pevzner 2002;J. van Helden 2000;Lars M.Jakt 2001) over the whole genome and are based on nucleotide frequencies and sampling probabilities; each one with its own advantages and disadvantages. Applying pattern discovery to a cluster of genes based on the similarity of their gene expression profi les is more advantageous than the strategy of using the entire genome (Eskin and Pevzner 2002;Hao Li 2002) and upstream DNA sequence multiple alignments (Frederick P. Roth 1998;Hertz and Stormo 1999;Sinha and Tompa 2002). Expression profi le clustering associates genes controlled by a regulatory cascade even if it may involve many different TFs and binding sites (Harmen J. Bussemaker 2001).
To estimate motif effects, we used a longitudinal model. Longitudinal data means when the same measurement is made repeatedly on experimental units over time, inducing correlation in the measurements within an experimental unit. As compared with cross-sectional data analysis, modeling of longitudinal data presents additional diffi culties in that we must specify the time trend of the population mean and the correlation structure of the observations, and how covariates affect both of these. The linear mixed models are extensions of linear regression models for longitudinal data. It contains fi xed and random effects where the random effects are used to model between-subject variation and the correlation induced by this variation; it is an extremely fl exible analysis tool. The estimation of motif effects by longitudinal model analysis that we present provides a method to obtain functional motifs from large scale of gene expression data sets. The gene expression longitudinal data is characterized by repeated observations over time on the same set of genes, and the main feature is that the repeated observations on the same gene tend to be correlated; the longitudinal model gives us an method to overcome the issue.
Identifi cation of TF binding sites remains problematic. Combining gene expression data with motif searching techniques provides improved identifi cation of regulatory sites. In the strategy presented here, the adjusted FDR and estimation of motif effects are demonstrated to provide a balance between false positive and false negative predictions. In the future, we will adopt this technique for genomic expression patterns  (Lee et al. 2004;McCarroll et al. 2004) and control the proportion of false positive (Fernando et al. 2004) to improve the accuracy of functional motifs, these are likely to help us in functional footprinting of the regulatory motif, and the building of genetic networks.